In [1]:
%%HTML
<link rel="stylesheet" type="text/css" href="css/custom.css">


Demystifying pandas internals

Marc Garcia - @datapythonista

About me

Marc Garcia

More info at my personal page and LinkedIn profile

Source code (jupyter notebook) of these slides:

Overview

  • Quick pandas overview
  • What can go wrong?
  • Why pandas? Why not pure Python?
  • Data types
  • Computer architecture
  • pandas internal structure
  • Memory copy
  • Future of pandas

Quick pandas overview: Titanic


In [2]:
import pandas

titanic = pandas.read_csv('data/titanic.csv')
titanic.head()


Out[2]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

In [3]:
# basic statistics

titanic[['Age', 'Fare']].describe()


Out[3]:
Age Fare
count 714.000000 891.000000
mean 29.699118 32.204208
std 14.526497 49.693429
min 0.420000 0.000000
25% 20.125000 7.910400
50% 28.000000 14.454200
75% 38.000000 31.000000
max 80.000000 512.329200

In [4]:
# percentage of missing values

titanic[['Age', 'Fare']].isna().mean()


Out[4]:
Age     0.198653
Fare    0.000000
dtype: float64

In [5]:
# imputing missing values with the median

titanic['Age'].fillna(titanic['Age'].median(), inplace=True)
titanic[['Age', 'Fare']].isna().mean()


Out[5]:
Age     0.0
Fare    0.0
dtype: float64

In [6]:
# plot distributions

titanic[['Age', 'Fare']].boxplot();



In [7]:
# dropping rows with 5% higher fares

titanic = titanic[titanic['Fare'] < titanic['Fare'].quantile(.95)]
titanic[['Age', 'Fare']].boxplot();



In [8]:
# plotting correlation

titanic.plot(kind='scatter', x='Age', y='Fare');



In [9]:
# compute correlation

titanic[['Age', 'Fare']].corr()


Out[9]:
Age Fare
Age 1.000000 0.145508
Fare 0.145508 1.000000

In [10]:
# get dummies for categories

dummies = pandas.get_dummies(titanic[['Sex', 'Embarked']])
dummies.head()


Out[10]:
Sex_female Sex_male Embarked_C Embarked_Q Embarked_S
0 0 1 0 0 1
1 1 0 1 0 0
2 1 0 0 0 1
3 1 0 0 0 1
4 0 1 0 0 1

In [11]:
# prepare data for machine learning models

x = titanic[['Pclass', 'Age', 'SibSp', 'Parch', 'Fare']].join(dummies)
y = titanic['Survived']
x.head()


Out[11]:
Pclass Age SibSp Parch Fare Sex_female Sex_male Embarked_C Embarked_Q Embarked_S
0 3 22.0 1 0 7.2500 0 1 0 0 1
1 1 38.0 1 0 71.2833 1 0 1 0 0
2 3 26.0 0 0 7.9250 1 0 0 0 1
3 1 35.0 1 0 53.1000 1 0 0 0 1
4 3 35.0 0 0 8.0500 0 1 0 0 1

In [12]:
import sklearn.ensemble

clf = sklearn.ensemble.RandomForestClassifier()
clf.fit(x, y)
pandas.Series(clf.feature_importances_, index=x.columns).sort_values(ascending=False)


Out[12]:
Age           0.266413
Fare          0.251490
Sex_male      0.226070
Pclass        0.087555
SibSp         0.056010
Sex_female    0.050103
Parch         0.033255
Embarked_S    0.014759
Embarked_Q    0.007314
Embarked_C    0.007031
dtype: float64

In [13]:
titanic.plot(kind='scatter', x='Age', y='Fare', color=y.replace({0: 'red', 1: 'blue'}));


This is pandas, and it is super cool

What can go wrong?

And why should be care about how pandas works internally?


In [14]:
# some code resembling pandas to make the traceback look nice
import numpy

cols_from_csv = rows_from_csv = lambda fname: 10_000_000

def read_csv(fname, *args, **kwargs):
    cols = cols_from_csv(fname)
    rows = rows_from_csv(fname)
    create_block_manager_from_arrays(rows, cols)

def create_block_manager_from_arrays(rows, cols):
    float_blocks = _multi_blockify((rows, cols), numpy.float64)

def _multi_blockify(shape, dtype):
    _stack_arrays(shape, dtype)

def _stack_arrays(shape, dtype):
    stacked = numpy.empty(shape, dtype)

original_read_csv = pandas.read_csv
pandas.read_csv = read_csv

In [15]:
import pandas

df = pandas.read_csv('data/huge_file.csv')


---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
<ipython-input-15-9c0c311f0922> in <module>()
      1 import pandas
      2 
----> 3 df = pandas.read_csv('data/huge_file.csv')

<ipython-input-14-548380dd66ac> in read_csv(fname, *args, **kwargs)
      7     cols = cols_from_csv(fname)
      8     rows = rows_from_csv(fname)
----> 9     create_block_manager_from_arrays(rows, cols)
     10 
     11 def create_block_manager_from_arrays(rows, cols):

<ipython-input-14-548380dd66ac> in create_block_manager_from_arrays(rows, cols)
     10 
     11 def create_block_manager_from_arrays(rows, cols):
---> 12     float_blocks = _multi_blockify((rows, cols), numpy.float64)
     13 
     14 def _multi_blockify(shape, dtype):

<ipython-input-14-548380dd66ac> in _multi_blockify(shape, dtype)
     13 
     14 def _multi_blockify(shape, dtype):
---> 15     _stack_arrays(shape, dtype)
     16 
     17 def _stack_arrays(shape, dtype):

<ipython-input-14-548380dd66ac> in _stack_arrays(shape, dtype)
     16 
     17 def _stack_arrays(shape, dtype):
---> 18     stacked = numpy.empty(shape, dtype)
     19 
     20 original_read_csv = pandas.read_csv

MemoryError: 

In [16]:
pandas.read_csv = original_read_csv

In [17]:
original_read_csv = pandas.read_csv

class DataFrame:
    def __init__(self, *args, **kwargs):
        pass

    @staticmethod
    def sum():
        import IPython.display
        img = IPython.display.Image(filename='img/waiting.jpg')
        IPython.display.display(img)

pandas.read_csv = DataFrame

In [18]:
import pandas

df = pandas.read_csv('data/huge_file.csv')
total = df.sum()



In [19]:
pandas.read_csv = original_read_csv

In [20]:
import pandas

titanic = pandas.read_csv('data/titanic.csv')
titanic_male = titanic[titanic['Sex'] == 'male']
titanic_male['Age'].fillna(titanic_male['Age'].median(), inplace=True)


/home/mgarcia/.anaconda3/lib/python3.6/site-packages/pandas/core/generic.py:4355: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)

In [21]:
import warnings
import pandas

warnings.filterwarnings('ignore')

titanic = pandas.read_csv('data/titanic.csv')
titanic_male = titanic[titanic['Sex'] == 'male']
titanic_male['Age'].fillna(titanic_male['Age'].median(), inplace=True)

...you know you did something wrong

Why pandas? Why not pure Python?


In [22]:
# pandas.read_csv('data/titanic.csv')
import csv

data = []
with open('data/titanic.csv') as f:
    reader = csv.reader(f)
    columns = next(reader)
    for row in reader:
        data.append(row)

In [23]:
columns = ('Name', 'Sex', 'Age', 'Fare', 'Survived')

data = [('Montvila, Rev. Juozas',                    'male',     27.,    13., 0),
        ('Graham, Miss. Margaret Edith',             'female',   19.,    30., 1),
        ('Johnston, Miss. Catherine Helen "Carrie"', 'female',  None,  23.45, 0),
        ('Behr, Mr. Karl Howell',                    'male',     26.,    30., 1),
        ('Dooley, Mr. Patrick',                      'male',     32.,   7.75, 0)]

In [24]:
# df.iloc[3]
data[3]


Out[24]:
('Behr, Mr. Karl Howell', 'male', 26.0, 30.0, 1)

In [25]:
# df['Fare']
fares = list(zip(*data))[columns.index('Fare')]
fares


Out[25]:
(13.0, 30.0, 23.45, 30.0, 7.75)

In [26]:
# df['Fare'].mean()
mean_fare = sum(fares) / len(fares)
mean_fare


Out[26]:
20.84

In [27]:
# df[df.Sex == 'male']
male = list(filter(lambda x: x[columns.index('Sex')] == 'male', data))
male


Out[27]:
[('Montvila, Rev. Juozas', 'male', 27.0, 13.0, 0),
 ('Behr, Mr. Karl Howell', 'male', 26.0, 30.0, 1),
 ('Dooley, Mr. Patrick', 'male', 32.0, 7.75, 0)]

Why pandas? Why not pure Python?

  • Easy and powerful indexing
  • Management of missing values
  • High-level functionality:
    • Statistics
    • Time series
    • Groupping
    • ...
  • Performance

Some computer engineering background

  • Data types
  • Computer architecture

Data types

unsigned integer 4 bits:

Binary Decimal Binary Decimal
0000 0 1000 8
0001 1 1001 9
0010 2 1010 10
0011 3 1011 11
0100 4 1100 12
0101 5 1101 13
0110 6 1110 14
0111 7 1111 15

Data types

signed integer 4 bits (two's complement):

Binary Decimal Binary Decimal
0000 0 1000 -8
0001 1 1001 -7
0010 2 1010 -6
0011 3 1011 -5
0100 4 1100 -4
0101 5 1101 -3
0110 6 1110 -2
0111 7 1111 -1

Data types

Decimal numbers:

  • Precision
  • Exponent

$value = precision \cdot 2^{exponent}$

$1328.543 = 1.328543 \cdot 10^{3}$

Data types

floating point 4 bits (2 bits exponent, 2 bits precision):

Binary Exponent Precision Decimal Binary Exponent Precision Decimal
00-00 0 1.00 1.00 10-00 -2 1.00 0.010
00-01 0 1.50 1.50 10-01 -2 1.50 0.015
00-10 0 1.25 1.25 10-10 -2 1.25 0.013
00-11 0 1.75 1.75 10-11 -2 1.75 0.018
01-00 1 0 +INF 11-00 -1 1.00 0.100
01-01 1 1 0 11-01 -1 1.50 0.150
01-10 1 2 NaN 11-10 -1 1.25 0.125
01-11 1 3 -INF 11-11 -1 1.75 0.175

Not all numbers can be represented


In [28]:
0.1 + 0.2


Out[28]:
0.30000000000000004

In [29]:
import numpy

numpy.array([256], dtype=numpy.uint8)


Out[29]:
array([0], dtype=uint8)

Special numbers exist in floating point but not integer representation: inf, -inf, NaN


In [30]:
import pandas

s = pandas.Series([1, 2])
print(s)


0    1
1    2
dtype: int64

In [31]:
s.loc[0] = float('NaN')
print(s)


0    NaN
1    2.0
dtype: float64

Example 1: Comparing list items, int vs float


In [55]:
import random

size = 10_000_000

list1 = [random.random() for i in range(size)]
list2 = [random.random() for i in range(size)]

In [56]:
import numpy
import pandas

series1 = pandas.Series(list1, dtype=numpy.uint8)
series2 = pandas.Series(list2, dtype=numpy.uint8)

print('Memory consumed: {:.2f} Mb'.format(series1.memory_usage(index=False) / 1024 / 1024))

%timeit (series1 > series2).mean()


Memory consumed: 9.54 Mb
16.6 ms ± 343 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [57]:
import numpy
import pandas

series1 = pandas.Series(list1, dtype=numpy.float64)
series2 = pandas.Series(list2, dtype=numpy.float64)

print('Memory consumed: {:.2f} Mb'.format(series1.memory_usage(index=False) / 1024 / 1024))

%timeit (series1 > series2).mean()


Memory consumed: 76.29 Mb
72.1 ms ± 1.76 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Computer architecture

Machine code

Address Binary Decimal (uint8)
0320 00001111 25
0321 00001010 41
0322 00000010 2
1) LOAD 0320 -> R1   (R1=25, R2= ?)
2) LOAD 0321 -> R2   (R1=25, R2=41)
3) ADD R1 R2 -> R2   (R1=25, R2=66)
4) LOAD 0322 -> R1   (R1= 2, R2=66)
5) ADD R1 R2 -> R2   (R1= 2, R2=68)

Machine code

1) LOAD 0320 -> R1   (R1=25, R2= ?)
2) LOAD 0321 -> R2   (R1=25, R2=41)  <- Preload to the CPU cache at 1)
3) ADD R1 R2 -> R2   (R1=25, R2=66)
4) LOAD 0322 -> R1   (R1= 2, R2=66)  <- Preload to the CPU cache at 1)
5) ADD R1 R2 -> R2   (R1= 2, R2=68)

Loading from the memory is slow:

  • CPU cache

Example 2: Summing list elements


In [36]:
import random
import functools
import operator

size = 10_000_000
list1 = [random.random() for i in range(size)]

%timeit functools.reduce(operator.add, list1)


886 ms ± 143 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [37]:
import pandas

s1 = pandas.Series(list1)

%timeit s1.sum()


80.2 ms ± 9.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Example 3: Comparing list items, Python vs pandas


In [38]:
%timeit sum(item1 > item2 for item1, item2 in zip(list1, list2)) / size


1.64 s ± 1.03 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [39]:
import pandas

series1 = pandas.Series(list1)
series2 = pandas.Series(list2)

%timeit (series1 > series2).mean()


71.1 ms ± 124 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

numpy and pandas implementation are somehow similar to the machine code we've seen.

What about the pure Python implementation?

Python is implemented with great ideas in mind:

  • Readability
  • Flexibility
  • Hassle free
  • No need to care about types

But not fast numerical computations


In [40]:
data = [1, 'foo', 3.141592, ['Alan', 'Dennis', 'Linus'],
        {'black': '#000000', 'white': '#ffffff'}]
data


Out[40]:
[1,
 'foo',
 3.141592,
 ['Alan', 'Dennis', 'Linus'],
 {'black': '#000000', 'white': '#ffffff'}]

Address PyObject
0320 ob_refcnt=1, ob_type=list, ob_size=3, ob_item=0510, allocated=16
... ...
0510 2478
0511 5601
0512 4882
... ...
2478 ob_refcnt=1, ob_type=int, ob_value=15
... ...
4882 ob_refcnt=1, ob_type=int, ob_value=2
... ...
5601 ob_refcnt=1, ob_type=int, ob_value=10
1) LOAD 0320
2) LOAD 0510
3) LOAD 2478 (ob_value) -> R1
4) LOAD 0511
5) LOAD 5601 (ob_value) -> R2
6) ADD R1 R2 -> R2
...

compare to:

1) LOAD 0320 -> R1   (R1=15, R2= ?)
2) LOAD 0321 -> R2   (R1=15, R2=10)
3) ADD R1 R2 -> R2   (R1=15, R2=25)

In [41]:
import pandas

pandas.Series(data)


Out[41]:
0                                           1
1                                         foo
2                                     3.14159
3                       [Alan, Dennis, Linus]
4    {'black': '#000000', 'white': '#ffffff'}
dtype: object

In [42]:
import pandas

series1 = pandas.Series(list1, dtype='object')
series2 = pandas.Series(list2, dtype='object')

%timeit (series1 > series2).mean()


868 ms ± 117 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [43]:
import numpy
import pandas

series1 = pandas.Series(list1, dtype=numpy.float64)
series2 = pandas.Series(list2, dtype=numpy.float64)

%timeit (series1 > series2).mean()


77.2 ms ± 4.74 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Pandas internal structure


In [44]:
import pandas

df = pandas.DataFrame({'foo': [1, 3, 7],
                       'bar': [.55, 1.76, 3.33],
                       'foobar': [109, 60, 13]},
                      columns=['foo', 'bar', 'foobar'])
print(df.dtypes)
df


foo         int64
bar       float64
foobar      int64
dtype: object
Out[44]:
foo bar foobar
0 1 0.55 109
1 3 1.76 60
2 7 3.33 13

In [45]:
df._data


Out[45]:
BlockManager
Items: Index(['foo', 'bar', 'foobar'], dtype='object')
Axis 1: RangeIndex(start=0, stop=3, step=1)
FloatBlock: slice(1, 2, 1), 1 x 3, dtype: float64
IntBlock: slice(0, 4, 2), 2 x 3, dtype: int64

In [46]:
df._data.blocks


Out[46]:
(FloatBlock: slice(1, 2, 1), 1 x 3, dtype: float64,
 IntBlock: slice(0, 4, 2), 2 x 3, dtype: int64)

In [47]:
df._data.blocks[1].values


Out[47]:
array([[  1,   3,   7],
       [109,  60,  13]])

In [48]:
type(df._data.blocks[1].values)


Out[48]:
numpy.ndarray

In [49]:
df._data.blocks[1].values.data


Out[49]:
<memory at 0x7f0dcc65cc18>

In [50]:
df._data.blocks[1].values.nbytes


Out[50]:
48

In [51]:
bytes_ = df._data.blocks[1].values.tobytes()

print(''.join('{:08b}'.format(byte) for byte in bytes_))


000000010000000000000000000000000000000000000000000000000000000000000011000000000000000000000000000000000000000000000000000000000000011100000000000000000000000000000000000000000000000000000000011011010000000000000000000000000000000000000000000000000000000000111100000000000000000000000000000000000000000000000000000000000000110100000000000000000000000000000000000000000000000000000000

In [52]:
df._data.blocks[1].values.strides


Out[52]:
(24, 8)

In [53]:
import numpy

data = numpy.empty((1000, 1000), numpy.float64)

%timeit data.sum(axis=0)
%timeit data.sum(axis=1)


799 µs ± 83.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
721 µs ± 79.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Memory copy

  • Read/write to memory is slow
  • When a operation (e.g. a filter) is performed, a copy or a view can be returned
    • pandas returns a view in few cases, but is not deterministic

In [54]:
import pandas

df = pandas.DataFrame({'foo': [1, 2, 3], 'bar': [5, 10, 15]})

df_view_or_copy = df[df.foo > 2]

df_view_or_copy['bar'] = 0  # are we modifying `df`? <- Chained indexing with unknown result: WARNING

df[df.foo > 2]['bar'] = 0  # same as before

df.loc[df.foo > 2, 'bar'] = 0  # we are modifying `df` (pandas manages __setitem__)

Future of pandas

  • BlockManager is tricky, get rid of it
  • Refactoring of types
  • numpy is a great backend in some cases, but add support for others too (e.g. Apache Arrow)
  • More consistency in missing values (add bitmaps to manage NaNs in any type
  • Native support of strings, not as object
  • Better management of memory
  • pandas in distributed systems (Dask, PySpark, Ibis...)

Join the pandas sprint tomorrow... :)

Random take aways

Pandas is really amazing

Use parquet, pickle... instead of csv when possible

Access to memory is key for performance

Data types are important (implications in memory, speed and possible values)

Avoid object dtype, and loops (PyObject) for optimal performance

Thank you

You can find me on the internet: @datapythonista